Comparative analysis of batch correction methods for FDG PET/CT using metabolic radiogenomic data of lung cancer patients

In radiomics research, the issue of different instruments being used is significant. In this study, we compared three correction methods to reduce the batch effects in radiogenomic data from fluorodeoxyglucose (FDG) PET/CT images of lung cancer patients. Texture features of the FDG PET/CT images and genomic data were retrospectively obtained. The features were corrected with different methods: phantom correction, ComBat method, and Limma method. Batch effects were estimated using three analytic tools: principal component analysis (PCA), the k-nearest neighbor batch effect test (kBET), and the silhouette score. Finally, the associations of features and gene mutations were compared between each correction method. Although the kBET rejection rate and silhouette score were lower in the phantom-corrected data than in the uncorrected data, a PCA plot showed a similar variance. ComBat and Limma methods provided correction with low batch effects, and there was no significant difference in the results of the two methods. In ComBat- and Limma-corrected data, more texture features exhibited a significant association with the TP53 mutation than in those in the phantom-corrected data. This study suggests that correction with ComBat or Limma methods can be more effective or equally as effective as the phantom method in reducing batch effects.


CancerSCAN
CancerSCAN is an NGS-based targeted-sequencing platform designed at our institution.The reliability of this assay was proved by robust analytic validation in previous studies, where the details of the experimental procedures were described [11][12][13] .CancerSCAN version 1 targets 83 genes, and version 2 targets 381 genes.The selected target genes for this customized platform were curated at the request of researchers and clinicians.These target genes were associated in the literature and public databases with targeted cancer therapies or therapy responses.Single nucleotide variants, small insertions/deletions, copy number variations, and gene fusions were detected

FDG PET/CT image analysis
Image feature extraction was based on a previous study and used the gradient-based segmentation method ('PET Edge') in MIM version 6.4 software (MIM Software Inc., Cleveland, OH, USA).The target tumor was identified by an experienced nuclear medicine physician (S.H.M.) who was unaware of all clinical information except the target tumor site.As the physician dragged the cursor from the center of the target tumor to a point near the edge of the lesion, six axes interactively extended.The length of each axis was restricted when a large gradient was detected along that axis.Then, the software automatically outlined a three-dimensional VOI on the tumor.After performing gradient-based segmentation of the target tumor lesion, we extracted PET image features using the Chang-Gung Image Texture Analysis toolbox (CGITA, https:// code.google.com/p/ cgita), an open-source software package implemented in MATLAB (version 2012a; MathWorks Inc., Natick, MA, USA).A total of 86 PET features available in CGITA were measured on each segment, and all the features were included for analysis.

Phantom correction
For phantom correction, FDG PET/CT images of a cylinder phantom (NEMA NU2-1994) were acquired by each instrument, Discovery LS and Discovery STE.Air, Teflon, and a hot cylinder were inserted.A hot cylinder phantom was filled by FDG solution with a concentration of 5 MBq/kg.The ratio of the hot cylinder to the background was set as 4:1.The texture parameters were obtained by the same method described above.Briefly, a VOI was drawn by the 'PET Edge' method on a hot cylinder, which was identified by an experienced nuclear medicine physician.The parameters of a drawn VOI were calculated by CGITA software.The ratio of each texture feature between the two instruments was calculated.The texture parameters of FDG PET/CT acquired from Discovery STE were corrected by the ratios calculated above.

Harmonization
As statistical batch-effect adjustment methods, we considered the 'removeBatchEffect' function in the Limma R package 14 and ComBat 15 , which were originally developed for expressions of RNA sequencing or microarrays, although they can be used for other data types.ComBat can make the mean and variance of the samples equal to either the global mean and variance, or those for the specified reference batch.Hereafter, the former is termed 'ComBat (Global), ' while those for the given batch are referred to as 'ComBat (Ref 1) or 'ComBat (Ref 2)' .'removeBatchEffect' in Limma R package employs a robust approach by incorporating batch information as a covariate within a linear modeling framework.It operates under the assumption that batch effects can be represented as linear additive effects.To address this, Limma fits a linear model to the data, including batch as a covariate.It then effectively mitigates batch effects by subtracting the estimated batch effect.
Finally, we assessed the existence of the batch effect by principal component analysis (PCA) plots, the rejection rates of the k-nearest neighbor batch effect test (kBET) method 16 , and silhouette scores 17 .If the principal component scores were clustered by batch, it may have suggested that the data were systemically inconsistent owing to the different batches.For subject i , let a i be the average distance to the other samples within the same batch and let b i be the minimum of the averaged distance from the samples in other batches.Then, the silhouette score, s i , is defined by where the larger value indicates the separation of batches, which leads to the batch effect.The kBET rejection rate considered the batch distribution of each sample's neighborhood, and it tested whether the proportion of the batch of k-nearest neighbors of randomly selected samples was similar to the global batch proportion using Pearson's χ 2 test.The number of nearest neighbors, k, was set to k = 10 , and the process was repeated 1000 times to calculate the rejection rate.A higher rejection rate meant that the proportions of local and global batches were significantly different.The means of the kBET rejection rate and silhouette scores with and without batch effect adjustment were compared with t-test method.www.nature.com/scientificreports/

Associative test between corrected features and gene mutations
The statistical association between each image feature and gene mutation was tested.For each gene, we checked the mutation, and the mutated gene was coded as '1' .Otherwise, it was coded as '0.' In order to provide sufficient statistical power in the study, if the minor mutation frequencies were less than 0.05, they were removed, and the remaining 50 genes were tested by using logistic regression.For each gene mutation (dependent variable) and each image feature (independent variable), a separate logistic regression model was built to analyze the association between the presence of a gene mutation and the value of an image feature.The significance level was set to 0.05 and the multiple testing problem was adjusted with the Benjamini-Hochberg method.All statistical analyses were performed using R software (v.4.0.4,R Foundation for Statistical Computing, Vienna, Austria).

Harmonization of FDG PET/CT image features
To identify the presence of the batch effect, the kBET rejection rate and silhouette score were calculated for image features before and after harmonization and conducting PCA.The PCA plots enabled evaluation of the distribution of data using a visual assessment.If the kBET rejection rate and silhouette score-parameters for measuring the batch effects-were closer to zero, it meant that there was a smaller batch effect between datasets.In Fig. 2, the PCA plots show substantial differences between batch 1 and batch 2 in uncorrected data and phantom-corrected data.For the uncorrected data and phantom-corrected data, subjects belonging to batch 1 data demonstrate greater dispersion with more outlying subjects.However, for ComBat-and Limma-corrected data, there are no significant differences between batch 1 and batch 2 (Fig. 2 upper panel).It is also shown that all harmonization methods lowered the kBET rejection rate and increased the absolute silhouette scores closer to zero (Fig. 2 lower panel).All p-values from the t-test comparing the results from the uncorrected data and corrected data are less than 0.05, which implies that the rejection rates and silhouette scores were significantly changed.

Association between corrected features and gene mutations
The p-values of logistic regression between the texture features and gene mutation profiles were plotted after adjusting the batch effects with ComBat (Global), ComBat (Ref 1), ComBat (Ref 2), Limma, and phantom methods (Fig. 3).The p-values from the phantom methods were compared with those from the other methods.The largest r 2 was from Combat (Global), and the smallest r 2 was from Limma (r 2 = 0.4528 for Combat, r 2 = 0.4325 for Limma).All significantly associated genes and image features in the phantom-corrected data were also significant in the data corrected by ComBat (Global).However, the reverse was not satisfied.There were two gene mutations demonstrating significant association with texture features, TP53 and IRS2.Three texture features-neighborhood intensity-difference coarseness, normalized co-occurrence entropy, and SUV statistics entropy-showed significant association with the TP53 mutation for all correction methods.Four texture parameters, co-occurrence contrast, co-occurrence dissimilarity, size-zone variability, and intensity variability showed significant association with the TP53 mutation only for ComBat and Limma methods in contrast to the phantom method (Table 1).The statistics are detailed in Table 2.

Discussion
In this study, the metabolic texture features of FDG PET/CT images from different instruments were unified by three correction methods: phantom correction, ComBat, and Limma.The uncorrected data demonstrated high batch effects compared to the corrected data.Although kBET rejection rate and silhouette score were lower in the  www.nature.com/scientificreports/phantom-corrected data than in the uncorrected data, the PCA plot showed similar variance.The ComBat and Limma methods provided correction with low batch effects, and there was no significant difference between the two methods.In ComBat-and Limma-corrected data, more texture features exhibited significant associations with the TP53 mutation than in the phantom-corrected data.
In practice, different PET/CT instruments from various vendors are utilized at different medical institutions and even within the same institute.Moreover, preparation of patients or image acquisition protocols differs according to the institute and region.There have been controversies whether there are significant differences in conventional image parameters, such as SUVmax, according to the instruments used 18,19 .Nevertheless, it is commonly acceptable to use conventional image parameters from different devices in clinical fields or studies owing to the overall similarity of examination procedures and image acquisition mechanisms.However, the instrument issue remains highly critical in the radiomics field.Previous studies suggested that radiomic features are sensitive to different acquisition methods and reconstruction parameters 20,21 .The heterogeneity of radiomic parameters between different instruments hinders employment of data from various scanners in clinical studies 22 .Therefore, pre-processing of radiomic data to remove batch effects is the most important step before conducting further analysis.
A phantom correction method is a conventional and basic approach to reduce the difference between each instrument.A cylinder phantom suggested by the National Electrical Manufacturers Association is widely used in performance test of PET/CT scanners 23,24 .As a measurement tool of the count rate or uniformity, it is used in quality control of scanners.Phantom correction has the advantage of identifying actual differences of image parameters from the measurement of a real phantom.However, it has two disadvantages.First, it is difficult in practice to acquire a reproducible correction coefficient between phantom images for various PET/CT scanners.Even in the same instrument, the measurement results may be changed each time according to different technicians or researchers.In addition, ambiguity exists in terms of which instrument is set as a reference if there are more than three scanners.Second, an appropriately heterogeneous phantom cannot be used.In real tumor tissue, there are few cases with high homogeneity along the whole tumor tissue.Nonetheless, a commonly used phantom can only provide a highly homogenic image owing to its simple structure.In the present study, the phantomcorrected data showed high variance in the PCA plot despite the low kBET rejection rate and silhouette score.Thus, phantom correction was deemed inferior to correction with ComBat and Limma methods with respect to practical limitations and clinical usefulness.It underscores the importance of integrating statistical methods There were more metabolic texture features that showed a significant association with the TP53 mutation in the ComBat-corrected data and Limma-corrected data than in the phantom-corrected data.This finding suggests that correction with ComBat or Limma methods can more sensitively detect a notable association in radiogenomics studies.The inherent batch effects in the uncorrected or even phantom-corrected data seem to have obscured true biological associations.Most texture parameters that showed a significant association in ComBat-corrected data overlapped with those in Limma-corrected data.Thus, ComBat and Limma methods is deemed to be comparable for utilization in downstream analysis.Nevertheless, it is difficult to absolutely define and verify if the result is false-negative or false-positive.Nygaard et al. suggested that methods removing batch effects, such as ComBat harmonization, may exaggerate the significance of downstream analyses 25 .With this consideration, it cannot be ignored that ComBat-corrected data may demonstrate false-positive texture parameters associated with the TP53 mutation (co-occurrence contrast, co-occurrence dissimilarity, etc.).Therefore, care should be taken in interpreting results and presuming biological meanings in radiogenomics research.Also, further analyses based on large-scale image data from a single scanner are warranted to evaluate the possibility of false positives.It is noteworthy that distinct patterns were observed post-correction with all methods, with certain texture features such as coarseness and entropy being consistently associated with TP53 mutation.This highlights the possibility that TP53 mutation may influence tumor heterogeneity, which is then reflected in radiomic features.By understanding these associations, insights can be gained into the image characteristics linked to genetic changes in lung cancer, thereby bridging the gap between tumor biology and its observable phenotype.Nevertheless, given the limited number of features displaying associations and the absence of external validation, it is challenging to conclusively determine a distinct trend between radiomic features and genetic profiles.Further research is warranted in this regard.
In the present study, both the Limma and ComBat methods were applied for harmonization.As an analytic tool, Limma was originally developed for genomic data such as RNA-sequencing or microarrays 14 .ComBat employs a Bayesian framework, assuming that batch effects can be modeled as shifts in location and scale.It estimates batch-specific parameters and adjust the data accordingly.On the other hand, Limma assumes under the assumption that batch effects can be represented as linear additive factors.It fits a linear model to the data, includes batch as a covariate, and then removes the estimated batch effect.As radiomics analysis provides diverse features from medical images, radiomics data and genomics data are common in high-dimensional data across individuals.The ComBat method has been widely used for harmonization of radiomic data 26 .There are a few studies applying the Limma method for radiomic data.A previous study utilized the Limma method for selection of differentially expressed radiomic features not for harmonization 27 .Another study found that radiomics models harmonized with Combat and Limma data were not different for predicting neoadjuvant chemotherapy efficacy in breast cancer 28 .It is supported by the present study that demonstrated good accordance of low batch effects and genomic data associations between results from ComBat and Limma methods.It is noteworthy that this study is the first study to investigate value of Limma method for harmonization of radiomic data from FDG PET/CT images in terms of association with genomic data.While batch effect correction methods such as ComBat and Limma are valuable tools for mitigating the impact of technical variation, they introduce the potential risk of over-or under-correction in radiomics data analysis.One limitation of ComBat and Limma is their underlying assumption of additive batch effects, which may not fully capture the complexity of batch effect structures inherent to radiomics data.Additionally, the challenge arises from the difficulty in accurately assessing the precise nature and extent of these batch effects, given the multifaceted and high-dimensional nature of radiomics data.Over-correction can lead to the unintended removal of genuine biological variation, diminishing the biological insights that can be gleaned from the data.This concern is further exacerbated by the intricate and multifaceted nature of biological variation in radiomics data, which often involves complex interactions between features and may not be fully disentangled by these methods.Consequently, researchers must exercise caution in interpreting results, as the challenge of striking the right balance between batch effect removal and preservation of biologically relevant signal remains an ongoing consideration in radiomics data analysis.
For a reliable and reproducible radiomic model, methodologic basis, such as autosegmentation, data processing, and correction is essential 25,26,29 .In addition, given the current trend creating novel and multidisciplinary approaches based on different types of clinical information and other characteristics, addressing the data heterogeneity is of utmost importance 30 .Especially in the realm of radiogenomic research, the criticality of addressing batch effects cannot be understated.Previous literature may have marginalized this issue, risking the obfuscation of genuine biological associations indispensable for precise clinical interpretation.This study elucidates respective efficacies of batch correction methods, highlighting the importance of data harmonization in radiomic studies.This focus on rigorous batch effect correction becomes even more crucial in the context of multi-center investigations where data variability is an inherent challenge.By ensuring that radiomic parameters robustly represent true tumor biology, it is possible to obtain more precise, reproducible, and clinically relevant findings.Therefore, the present study underscores the necessity for meticulous radiomic data preprocessing in clinical field and oncology research.Furthermore, it is anticipated to enhance the reliability of previous investigations pertaining to the prognostic value of radiomic features, thereby augmenting their potential applicability 31,32 .
This study had several limitations.First, there was no evaluation for reproducibility of the phantom correction method.We produced a reference ratio between two instruments from a single experiment.However, it is reasonably hypothesized that a reference ratio would not differ significantly if a cylinder phantom is filled with an FDG solution with the most homogeneous concentration.Furthermore, even if there is significant non-reproducibility in a phantom correction method, it would support our discussion to suggest the practical superiority of a harmonization method.Second, only two different instruments were used in this study.In multi-center trials, FDG PET/CT images from three or more scanners may be enrolled.A further study with more scanners would www.nature.com/scientificreports/more strongly support the substitutability of the harmonization method.Third, several patients were excluded due to different therapeutic history or quality control.Generally, excluding patients is considered a potential source of selection bias.However, incorporating genomic data from treated tissues may not be accurate, as they don't reflect inherent properties of untreated tumors.Using data without quality control can compromise study results, and including atypical lung cancers might limit the study's broader applicability.Thus, it is contended that our exclusions in fact enhance the study's reliability and generalizability.Fourth, despite of harmonization, there is an inherent limitation of heterogeneity stemming from target segmentation, and feature extraction algorithms 31 .While uniformity can be partially achieved in same methods of target segmentation or feature extraction software, aligning all conditions equally is a significant challenge.Finally, this study focused on the association between the texture parameters and the genetic characteristics of lung cancer.Although genomic data have been widely used in recent oncology fields, clinical outcomes, such as disease-free survival and overall survival, remain the most important factors in clinical practice and research.Further investigation is warranted to focus on the correlation between the corrected texture parameters and clinical outcomes in various cancer types beyond lung cancer.
In conclusion, ComBat-and Limma-corrected data showed fewer batch effects than phantom-corrected and uncorrected data.ComBat and Limma correction reduced batch effects with no significant difference between the two methods.In ComBat and Limma-corrected data, more texture features demonstrated a significant association with the TP53 mutation than those in phantom-corrected data.These findings suggest that correction with ComBat or Limma methods can be more effective or equally as effective at reducing batch effects than correction with the phantom method.Despite the possibility of false-positive findings, ComBat-corrected data or Limma-corrected data may be acceptable for use in further analyses considering their practical availability and results with comprehensible association with genetic characteristics.

Figure 2 .
Figure 2. PCA plots (top) and box plots of kBET rejection rates and silhouette scores (bottom) by each batch correction method.The p-values from the t-test compared with uncorrected data are noted in the top of each box plot.

Table 2 .
Significant results of logistic regression.a p-values are adjusted using the Benjamini-Hochberg method.